80 research outputs found

    A Fast Quartet Tree Heuristic for Hierarchical Clustering

    Get PDF
    The Minimum Quartet Tree Cost problem is to construct an optimal weight tree from the 3(n4)3{n \choose 4} weighted quartet topologies on nn objects, where optimality means that the summed weight of the embedded quartet topologies is optimal (so it can be the case that the optimal tree embeds all quartets as nonoptimal topologies). We present a Monte Carlo heuristic, based on randomized hill climbing, for approximating the optimal weight tree, given the quartet topology weights. The method repeatedly transforms a dendrogram, with all objects involved as leaves, achieving a monotonic approximation to the exact single globally optimal tree. The problem and the solution heuristic has been extensively used for general hierarchical clustering of nontree-like (non-phylogeny) data in various domains and across domains with heterogeneous data. We also present a greatly improved heuristic, reducing the running time by a factor of order a thousand to ten thousand. All this is implemented and available, as part of the CompLearn package. We compare performance and running time of the original and improved versions with those of UPGMA, BioNJ, and NJ, as implemented in the SplitsTree package on genomic data for which the latter are optimized. Keywords: Data and knowledge visualization, Pattern matching--Clustering--Algorithms/Similarity measures, Hierarchical clustering, Global optimization, Quartet tree, Randomized hill-climbing,Comment: LaTeX, 40 pages, 11 figures; this paper has substantial overlap with arXiv:cs/0606048 in cs.D

    Normalized Web Distance and Word Similarity

    Get PDF
    There is a great deal of work in cognitive psychology, linguistics, and computer science, about using word (or phrase) frequencies in context in text corpora to develop measures for word similarity or word association, going back to at least the 1960s. The goal of this chapter is to introduce the normalizedis a general way to tap the amorphous low-grade knowledge available for free on the Internet, typed in by local users aiming at personal gratification of diverse objectives, and yet globally achieving what is effectively the largest semantic electronic database in the world. Moreover, this database is available for all by using any search engine that can return aggregate page-count estimates for a large range of search-queries. In the paper introducing the NWD it was called `normalized Google distance (NGD),' but since Google doesn't allow computer searches anymore, we opt for the more neutral and descriptive NWD. web distance (NWD) method to determine similarity between words and phrases. ItComment: Latex, 20 pages, 7 figures, to appear in: Handbook of Natural Language Processing, Second Edition, Nitin Indurkhya and Fred J. Damerau Eds., CRC Press, Taylor and Francis Group, Boca Raton, FL, 2010, ISBN 978-142008592

    Normalized Information Distance

    Get PDF
    The normalized information distance is a universal distance measure for objects of all kinds. It is based on Kolmogorov complexity and thus uncomputable, but there are ways to utilize it. First, compression algorithms can be used to approximate the Kolmogorov complexity if the objects have a string representation. Second, for names and abstract concepts, page count statistics from the World Wide Web can be used. These practical realizations of the normalized information distance can then be applied to machine learning tasks, expecially clustering, to perform feature-free and parameter-free data mining. This chapter discusses the theoretical foundations of the normalized information distance and both practical realizations. It presents numerous examples of successful real-world applications based on these distance measures, ranging from bioinformatics to music clustering to machine translation.Comment: 33 pages, 12 figures, pdf, in: Normalized information distance, in: Information Theory and Statistical Learning, Eds. M. Dehmer, F. Emmert-Streib, Springer-Verlag, New-York, To appea

    Sensitivity to inflectional morphemes in the absence of meaning: evidence from a novel task

    Get PDF
    A number of studies in different languages have shown that speakers may be sensitive to the presence of inflectional morphology in the absence of verb meaning (Caramazza et al., 1988, Clahsen, 1999, Post et al., 2008). In this study, sensitivity to inflectional morphemes was tested in a purposely developed task with English-like nonwords. Native speakers of English were presented with pairs of nonwords and were asked to judge whether the two nonwords in each pair were the same or different. Each pair was composed either of the same nonword repeated twice, or of two slightly different nonwords. The nonwords were created taking advantage of a specific morphophonological property of English, which is that regular inflectional morphemes agree in voicing with the ending of the stem. Using stems ending in /l/, thus, we created: 1. nonwords ending in potential inflectional morphemes, vɔld, 2. nonwords without inflectional morphemes, vɔlt, and 3. a phonological control condition, vɔlb. Our new task endorses some strengths presented in previous work. As in Post et al. (2008) the task accounts for the importance of phonological cues to morphological processing. In addition, as in Caramazza et al. (1988) and contrary to Post et al. (2008), the task never presents bare-stems, making it unlikely that the participants would be aware of the manipulation performed. Our results are in line with Caramazza et al. (1988), Clahsen (1999) and Post et al. (2008), and offer further evidence that morphologically inflected nonwords take longer to be discriminated compared to uninflected nonwords

    Effect of heuristics on serendipity in path-based storytelling with linked data

    Get PDF
    Path-based storytelling with Linked Data on the Web provides users the ability to discover concepts in an entertaining and educational way. Given a query context, many state-of-the-art pathfinding approaches aim at telling a story that coincides with the user's expectations by investigating paths over Linked Data on the Web. By taking into account serendipity in storytelling, we aim at improving and tailoring existing approaches towards better fitting user expectations so that users are able to discover interesting knowledge without feeling unsure or even lost in the story facts. To this end, we propose to optimize the link estimation between - and the selection of facts in a story by increasing the consistency and relevancy of links between facts through additional domain delineation and refinement steps. In order to address multiple aspects of serendipity, we propose and investigate combinations of weights and heuristics in paths forming the essential building blocks for each story. Our experimental findings with stories based on DBpedia indicate the improvements when applying the optimized algorithm

    Satellites Form Fast & Late: a Population Synthesis for the Galilean Moons

    Get PDF
    The satellites of Jupiter are thought to form in a circumplanetary disc. Here we address their formation and orbital evolution with a population synthesis approach, by varying the dust-to-gas ratio, the disc dispersal timescale and the dust refilling timescale. The circumplanetary disc initial conditions (density and temperature) are directly drawn from the results of 3D radiative hydrodynamical simulations. The disc evolution is taken into account within the population synthesis. The satellitesimals were assumed to grow via streaming instability. We find that the moons form fast, often within 10410^4 years, due to the short orbital timescales in the circumplanetary disc. They form in sequence, and many are lost into the planet due to fast type I migration, polluting Jupiter's envelope with typically 15 Earth-masses of metals. The last generation of moons can form very late in the evolution of the giant planet, when the disc has already lost more than the 99% of its mass. The late circumplanetary disc is cold enough to sustain water ice, hence not surprisingly the 85% of the moon population has icy composition. The distribution of the satellite-masses is peaking slightly above Galilean masses, up until a few Earth-masses, in a regime which is observable with the current instrumentation around Jupiter-analog exoplanets orbiting sufficiently close to their host stars. We also find that systems with Galilean-like masses occur in 20% of the cases and they are more likely when discs have long dispersion timescales and high dust-to-gas ratios.Comment: 15 pages, 17 figures. Accepted by MNRAS, please check the final published versio

    Evaluation of Fused Pyrrolothiazole Systems as Correctors of Mutant CFTR Protein

    Get PDF
    Cystic fibrosis (CF) is a genetic disease caused by mutations that impair the function of the CFTR chloride channel. The most frequent mutation, F508del, causes misfolding and premature degradation of CFTR protein. This defect can be overcome with pharmacological agents named "correctors". So far, at least three different classes of correctors have been identified based on the additive/synergistic effects that are obtained when compounds of different classes are combined together. The development of class 2 correctors has lagged behind that of compounds belonging to the other classes. It was shown that the efficacy of the prototypical class 2 corrector, the bithiazole corr-4a, could be improved by generating conformationally-locked bithiazoles. In the present study, we investigated the effect of tricyclic pyrrolothiazoles as analogues of constrained bithiazoles. Thirty-five compounds were tested using the functional assay based on the halide-sensitive yellow fluorescent protein (HS-YFP) that measured CFTR activity. One compound, having a six atom carbocyle central ring in the tricyclic pyrrolothiazole system and bearing a pivalamide group at the thiazole moiety and a 5-chloro-2-methoxyphenyl carboxamide at the pyrrole ring, significantly increased F508del-CFTR activity. This compound could lead to the synthesis of a novel class of CFTR correctors

    Semantic disambiguation and contextualisation of social tags

    Full text link
    The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-642-28509-7_18This manuscript is an extended version of the paper ‘cTag: Semantic Contextualisation of Social Tags’, presented at the 6th International Workshop on Semantic Adaptive Social Web (SASWeb 2011).We present an algorithmic framework to accurately and efficiently identify the semantic meanings and contexts of social tags within a particular folksonomy. The framework is used for building contextualised tag-based user and item profiles. We also present its implementation in a system called cTag, with which we preliminary analyse semantic meanings and contexts of tags belonging to Delicious and MovieLens folksonomies. The analysis includes a comparison between semantic similarities obtained for pairs of tags in Delicious folksonomy, and their semantic distances in the whole Web, according to co-occurrence based metrics computed with results of a Web search engine.This work was supported by the Spanish Ministry of Science and Innovation (TIN2008-06566-C04-02), and Universidad Autónoma de Madrid (CCG10-UAM/TIC-5877
    • …
    corecore